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ABSTRACT 

This work focuses on top-k recommendation in domains 
where underlying data distribution shifts overtime. We pro¬ 
pose to learn a time-dependent bias for each item over what¬ 
ever existing recommendation engine. Such a bias learning 
process alleviates data sparsity in constructing the engine, 
and at the same time captures recent trend shift observed 
in data. We present an alternating optimization framework 
to resolve the bias learning problem, and develop methods 
to handle a variety of commonly used recommendation eval¬ 
uation criteria, as well as large number of items and users 
in practice. The proposed algorithm is examined, both of¬ 
fline and online, using real world data sets collected from 
a retailer website. Empirical results demonstrate that the 
bias learning can almost always boost recommendation per¬ 
formance. We encourage other practitioners to adopt it as 
a standard component in recommender systems where tem¬ 
poral dynamics are a norm. 

Categories and Subject Descriptors 

H. 2.8 [Information Technology and Systems ]: Database 
Applications —Data Mining] H.3.3 [Information Storage 
and Retrieval]: Information Filtering 

General Terms 

Algorithms, Experimentation, Performance 

Keywords 

bias learning, temporal dynamics, top-k recommendation 

I. INTRODUCTION 

Recommender systems have been extensively studied in 
different domains including eCommerce [^, movie/music 
ratings [^, news personalization |^, content recommenda¬ 
tion at web portals [^, etc. And all sorts of methods have 
been proposed for recommendation [^, including content- 
based methods, neighborhood based approaches [^, latent 
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Figure 1: Rank changes in past one year of top 10 
popular products in early September, 2013. The rel¬ 
ative ranking of products were determined by sales 
volume. The number of each curve represent the 
product ranking at early September. To make it 
legible, 0 indicates 10. (Better viewed in color) 

factor models like SVD or matrix factorization [^. We no¬ 
tice that many methods assume data in recommender sys¬ 
tems is static or follows the same distribution. Hence, a 
significant body of existing works evaluate their proposed 
models following a cross-validation procedure by randomly 
hiding some entries in a user-item matrix for testing. 

However, this is not the case for recommender systems 
where user feedbacks are collected over time. In this pa¬ 
per, we take one real-world eCommerce website (denoted 
as XYZ) as an example. The fundamental task is to rec¬ 
ommend relevant items to customers, hopefully to increase 
purchases and thus revenues and profits for the company 
in the long run. It is observed that the trending products 
change week by week, or even day by day, due to user inter¬ 
est change, demand shift ignited by some external events, 
or simply because a product is out of inventory or just on 
shelf. Figure [^demonstrates the fluctuation of relative rank¬ 
ing (determined by sales volume) of the top 10 popular prod¬ 
ucts at XYZ identified at the first week of September, 20130 
Notice that some portion of the lines are missing for certain 
weeks in the figure, indicating no transaction at the corre¬ 
sponding moments. Among the 10 products, only 6 were 

^All privacy-sensitive items and services oriented products 
and warranties are removed for analysis. 
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Figure 2: Relative improvement of recommendation 
with different time windows for model training 


sold one year ago, The other 4 products were not even sold 
at the website yet. The variance is huge as observed in the 
hgure. For instance, the 3rd popular product were ranked 
after 10,000 even just few weeks ago. Majority of the select 
products, did not enter top-10 mostly in the past. 

The temporal dynamics described above pose thorny chal¬ 
lenges for recommendation, because it violates the funda¬ 
mental assumption in most collaborative hltering that train¬ 
ing and test data share the same distribution. Simply ignor¬ 
ing this discrepancy will lead to bad user experience . For 
instance, we may recommend a product that is popular one 
year ago, but now has been replaced by new models. This is 
particular common for products on promotion during holi¬ 
day season. Another naive solution to deal with the discrep¬ 
ancy is to restrict the model training to consider transactions 
only occurring the past few days so that the test distribution 
would not change too much from that of training. However, 
that essentially discards plenty of past transaction informa¬ 
tion for model training, leading to a more severe data spar¬ 
sity problem. 

Data sparsity have been widely observed in many differ¬ 
ent domains. At XYZ, for example, majority of customers 
purchase only few items across a whole year, and majority 
of products have relatively small number of transactions. It 
is not a wise choice to discard data for model training. With 
fewer data samples, the estimation of related parameters in 
the model will be coupled with high variance. Moreover, 
a shorter time-window typically results in a smaller cover¬ 
age of items that can be leveraged for recommendation since 
some items may not appear in the select window. In Fig¬ 
ure we plot the relative performance improvement of a 
Markov modej^when we expand the time window of train¬ 
ing to past 1 month, 3 months, 6 months and 15 months 
respectively. The baseline is the model trained using the 
past 1-week transactions only. Apparently, the larger the 
time window (and thus the more data) we use for training, 
the better the performance is, implying that we should col¬ 
lect as much data for training as possible. 

Hence we have the following dilemma: on one hand, we 
want to exploit all possible transactions to overcome spar¬ 
sity; on the other hand, we wish to capture the trend change 
in recommendation by examining recent transactions only. 
How can we capture temporal dynamics without discarding 
data for training? In this work, we propose to learn item- 
specific biases in order to capture the temporal trend change 
for recommendation, while the recommendation model itself 
is still constructed using as much data as possible. We dehne 

^Details and evaluation criteria are described later. 


a bias learning problem and then propose one algorithm to 
optimize biases for given evaluation criteria of top-/c recom¬ 
mendation. Its convergence properties and time complexity 
are carefully examined. Experiments demonstrate the bias 
learning process will almost always improve the performance 
of base recommendation model, by keeping pace with the 
temporal dynamics of user-item interactions. 


2. TOP-k selection AND EVALUATION 

Before proceeding to the bias learning problem, we need to 
review the commonly used procedure in top-Zc recommenda¬ 
tion and corresponding evaluation measures. Without loss 
of generality, we will adopt standard terms user and item to 
speak about the recommendation problem throughout the 
paper . We assume there are in total n users, m items, u, v 
are used to denote the index for users, i,j for items, and 
p, q the rank positions of items in top-k. More symbols and 
their meanings can be referred in Table 

2.1 Top-fc Selection 

In recommender systems, the number of items to recom¬ 
mend is typically bounded by certain number. For exam¬ 
ple, in email marketing, the number of items can be recom¬ 
mended is limited by the email template. Assuming that we 
need to select k items for a given user u, a common practice 
is to have an recommendation engine yielding a prediction 
score for each item as follows: 


fn* — {ful, fu2, • • • , fup) 


( 1 ) 


where fui denotes the prediction score of item i for user u. 
Note that some recommendation models yield scores only 
for a select set of candidate items, with others default to 0. 
Then items are ranked according to their prediction scores 
and the top ranking ones will be selectecQ Equivalently, we 
hnd an ordered set of k items with maximal scores: 

( 2 ) 

S.t. 

> fui, Vp € {1, 2, • • • , k},j i 7^(f„.) 
where = f^j(p) is the score of the top p-th item. 


2.2 Evaluation Criteria 

As for evaluation, the top ranking items 7^(fu*) and the 
relevance of items have to be provided. Different from typ¬ 
ical ratings in collaborative hltering, we focus on recom¬ 
mendations where there are only binary responses: relevant 
or irrelevant. In our application, we deem one item to be 
relevant for a user when the user purchases the item. Let 
Vui G {0,1} denote the relevance of item i for a particular 
user with 0 being irrelevant, and 1 relevant. Eor presen¬ 
tation convenience, we use to denote the relevance of 
the p-th top ranking item for u. 

Eor top-k recommendation, standard evaluation criteria 
include aeeuraey (ACC), mean average preeision (MAP), 
and normalized diseounted eumulative gain (NDCG). Since 
all of them compute performance with respect to top-k rec¬ 
ommendation of individual user and then average across all 
users, we shall just describe them with respect to one user. 

^We assume items are selected solely based on prediction 
scores, while rese archers have been considering other factors 
like diversity [^, which is beyond the scope of this work. 













Accuracy measures how many items in the top-k recom¬ 
mendation are indeed relevant. 

1 ^ 

ACC@k{u) =■ ( 3 ) 

P=1 


Accuracy does not care about position of items once they 
enter into top-k recommendation. By contrast, the other 
two measures average precision {AP@k) and normalized dis¬ 
counted cumulative gain {NDCG@k) take into account item 
position in the top-k recommendation as well. 

The precision up to rank p (denoted as Precision{p,u)) 
is essentially ACC@p{u). AP is the average of precision at 
positions of those relevant items: 


AP@k{u) 


Y^p=i • Precision{p, u) j 
min{/c, # relevant items for u} 


( 4 ) 


The denominator in Eq. 0 is to account for cases when 
users are associated with different numbers of relevant items. 
Average precision is 1 when all relevant items are ranked at 
the top. When averaging AP@k for all users, we obtain 
mean average precision (MAP@k). 

Discounted cumulative gain (DCG) is initially proposed 
for rankings with different degree of relevance: 


k 

DCG@k{u) = 

p=l 


2 ^' 


^p) 


- 1 


log{l +p) 


E 


.Xp) 


log{l + j) 


( 5 ) 


Eq 0 follows because € {0,1} in our case. However, 
DCG generally varies with respect to k and number of rele¬ 
vant items for u , making it difficult for comparison. Hence, 
normalized discounted cumulative gain (NDCG) is proposed 
to normalize the DCG into [0,1]: 


NDCG(^k{u) = 


y 


(p) 


log{l+j) 


j '^ku 1 


( 6 ) 


Table 1: Nomenclature 


Symbol(s) 

Representation 

n 

total number of users 

m 

total number of items 

k 

number of items to recommend 

u, v 

index number of users 

hj 

index number of items 


index number of rank positions 

fui 

prediction score of item i for u 

Av) 

Ju 

prediction score of the top p-th item for u 

fn* 

prediction scores for u 

u 

prediction scores of item i for all users 

Vui 

relevance of item i for u 

(p) 

Vu 

relevance of the top p-th item for u 

bi 

the bias of item i 


snow storm is very likely to lead to booming purchases of 
heaters. While capturing all varieties of external factors can 
be one way to demystify the temporal dynamics, it is chal¬ 
lenging to encompass them all. Alternatively, we take one 
data-based approach to learn a bias for each item. In par¬ 
ticular, we collect transactions in recent few days or weeks, 
and attempt to determine item-specific biases such that cer¬ 
tain evaluation measure is maximized. The problem can be 
formally stated as follows: 

Given: 

• a set of users U\ 

• evaluation criterion per/; 

• recent relevance information Y of items for 
users in with pui denoting the relevance 
of item i for user u\ 

• prediction scores F for the set of users 
with fui being the score of item i for u; 


where Zku is the maximal DGG, i.e., the DGG when all 
relevant items are ranked at the top. 

Once we have the performance measure {AGG@k, AP@k, 
or NDGG@k) for each individual user, the overall perfor¬ 
mance for a set users U can be computed as the mean: 

1 

perf@k(U) = — ^per/@/c(u). (7) 

U=1 

Since the ranking of items are determined based on score 
predictions, and so are the corresponding performance, for 
presentation convenience we can rewrite Eq. Q as 

per/(fi*,f 2 *, • • • ,fn*) = perf@k(l{), 
where denotes the prediction scores for user u. 

3. THE BIAS LEARNING PROBLEM 

As mentioned in introduction, recommendation in eGom- 
merce, on one hand, suffers from data sparsity, hence it is 
imperative to train the recommendation model with trans¬ 
action history of as long as possible. On the other hand, the 
temporal dynamics of consumer purchase lead us to weigh 
more for those items that are recently trending. Such fluc¬ 
tuation can be attributed to all kinds of factors. Eor exam¬ 
ple, retailer itself occasionally may post products with huge 
discount for promotion. External events like a nation-wide 


Find: a vector of biases b = ,^m)^ 

with bi indicating the bias of item i so that 

• the top-k recommendation are selected as 

+ b) according to Eq. ([^; 

• the corresponding performance 

per/(fi* + b, f 2 * + b, • • • , + b) (8) 

is maximized. 

Because the biases are learned with recent transactions 
only, it would capture the recent trend change. The re¬ 
sultant biases can be different depending on the evaluation 
criterion. Note that all the three measures AGG@k, AP@k 
and NDGG@k are not smooth with respect to prediction 
scores due to the top-k selection and dichotomy of relevance, 
making the problem above intractable to solve analytically. 
However, we shall show that the problem can be solved iter¬ 
atively, which is guaranteed to terminate in finite steps and 
converge to a coordinate-wise (local) optimal. 

4. ALGORITHM 

Since the bias learning aims to find a bias for each item, 
we can rewrite the objective in Eq. 0 with respect to items: 

max per/(f*i + 6i, f *2 + •)'''•) T bm )■ (9) 

bi,b2,--- ,bm 
















The problem is difficult to resolve because of the ranking 
hidden in calculating in perf. Yet, the problem is solvable 
if we optimize one bi at a time. We propose to adopt an 
alternating optimization approach. That is, we fix the scores 
and biases for all other items while optimizing the bias for 
one particular item. We can cycle through all items until the 
objective function is stabilized. Next, we first describe the 
case of finding the optimal bias for one single item. We’ll 
use ACC@k as an example to derive the algorithm and then 
generalize it to handle other types of evaluation criteria like 
MAP@k and NDCG@k. 

4.1 Finding optimal bias for single item 

Keep in mind the accuracy contribution of one item solely 
depends on whether the item enters into top-k. We start 
with the simplest case: assuming for a given user u, item i 
is not in top-k recommendation, how will the objective in 
Eq. 0 change accordingly if we increase the score of fui to 
push the item i into top-Zc? Once item i is pushed into top-k, 
naturally the top k-th item in the original recommendation 
will be discarded. This swap of items has four cases when 
considering the relevance of each item, which is shown below: 


Vui 

— 

Vu 

bui — {Vui yu ^ / k 

0 

0 

0 

1 

1 

0 

0 

1 

-1/k 

1 

0 

1/k 


Apparently, when both items share the same relevance, that 
is, either both relevant or irrelevant, the accuracy of user u 
would not change according to Eq. 0. The accuracy alters 
only if these two items are associated with different relevance 
information. In particular, when the item discarded from 
top-k is relevant while item i is not, the accuracy would 
decrease by 1/k. By contrast, if item i is relevant but the 
item discarded is not, the accuracy increases by 1/k. We 
can summarize the accuracy change as {i/ui — yu^)/k if we 
push item i into the top-k recommendation. 

In order to push item i into top-/c recommendation, the 
add-on bias should be at least — fui- Note that the 
bias value can be negative. Once the item enters top-k rec¬ 
ommendation, the accuracy does not change even if we in¬ 
crease the bias more. We can record the necessary add-on 
bias of item i to enter top-k recommendation for each user, 
and compute its corresponding accuracy change. Then we 
can determine the overall accuracy change for all users at 
different bias values and pick the optimal bias with maxi¬ 
mal utility increase. Eor presentation convenience, we define 
utility as the aeeuraey ehange after item i enters into top-k 
reeommendation. The utility is 0 if the item does not ap¬ 
pear in the top-k recommendation of any user. This serves 
as the reference point (origin) in comparing utility values of 
different bias candidate values. 

Algorithm presents the procedure to find the optimal 
bias for one single item i with respect to ACC@k. Line 2-11 
computes all possible bias values which may lead to a utility 
change. Note that line 7-9 computes the current utility score 
when bias is set to 0. We’ll prefer a 0 bias if no utility 
improvement is possible. Line 12 considers the extreme case 
that item i does not enter into top-k recommendation for 
any user, i.e., we set the bias to negative infinity (say, a 
huge negative constant), and the corresponding utility is 
zero. This is included due to cases that item i entering 


Algorithm 1: Eind optimal bias for item i 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


Input: item i, scores and relevance y^ci, yl^^; 

Output: optimal bias bi and utility change A^; 
init candidate bias values S = (p] cur^util = 0; 
for each u do 

if yui / yi^^ then 

compute utility change Sui = {yui - y^^)/k\ 

compute score difference Sui = fu^^ — fui\ 
append {suiAui) to S\ 
if Sui < 0 then 
I cur-util = cur-util -h Sui ; 

end 

end 

end 

push (—m/,0) to S; 

find bi and maxjutil via subroutine in Algorithm 
update Aj = max-util — cur-util; 

return bi, A^ 


Algorithm 2: Subroutine to find optimal bias given can¬ 
didate <bias, utility change> pairs 

Input: cur-util, candidate pairs S = {(s,(5)}; 

Output: optimal bias b and maxjutil] 

1 S = sort S by score difference s; 

2 init 6 = 0, max-util = cur-util; 

3 init util = 0; 

4 for each (s, S) in S do 

5 util = util -h S; 

6 if util > max-util then 

7 b = s; 

8 maxjatil = util; 

9 end 
10 end 


into top-k recommendation leads to a negative utility. Then 
it is better to exclude item i from top-/c recommendation. 
Once we have all the potential bias values that may result 
in a different utility score, the subroutine in Algorithm 
sorts the bias values in ascending order, and figure out the 
optimal bias which leads to the maximal utility. Then in 
line 14-15 of Algorithm we report the optimal bias and 
corresponding utility change comparing with current utility. 

The time complexity for such a procedure is 0(n log n) 
where n is the number of users. Line 2-11 in Algorithm^ 
cost linear time. Algorithm 2 costs 0(n log n) because of the 
sorting in line 1. Therefore, the total computational time to 
find the optimal bias for one single item is O(nlogn). 

4.2 Finding optimal biases for all items 

In the previous subsection, we have present an algorithm 
to learn the optimal bias of one single item when fixing the 
scores for all other items. In order to find optimal biases 
for all items, we propose to cycle through all items to up¬ 
date biases, tantamount to the well-known coordinate de¬ 
scent method in optimization. The detailed algorithm in 
shown in Algorithm |15| 

The algorithm consists of two loops: the inner loop cycles 
through all items to find the optimal bias individually. The 
outer loop stops if no utility increase can be found. The re¬ 
lated update in lines 8-12 after changing the bias of one item 
is straightforward except line 12. In order to save computa¬ 
tional cost, we can record the top-k recommendation given 
















Algorithm 3: Learning Biases for All Items 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


Input: score predictions {fi*, f 2 *, • • • , fn*}, relevance 
information{yi*, y 2 *, • • • , yn*}; 

Output: optimal biases b* = ‘ ^ 

construct candidate item set C to learn optimal bias; 
init 6* = 0 for all z G C; 
init ^utility — Ij 
while Autiiity > 0 do 
reset ^utility 
for each item i in C do 

compute potential utility change and 

corresponding bias bi following algorithm [l| 
if > 0 then 

update ^utility — ^utility 5 

update 6* = 6* + ; 

update prediction scores Lj for item z; 
update top k-th recommendation for each user if 
necessary; 
end 
end 
end 


current scores, and the minimum score in the top-k recom¬ 
mendation for each user. Based on the minimum score, we 
can immediately decide whether the item i is in the top-k 
recommendation before and after bias is updated. If the 
item i is not among top-Zc recommendation before and after 
bias update, then nothing needs to be done. We can take 
care of other cases in a similar vein so that we do not have 
to recompute top-Zc recommendation for each user, and thus 
speed up the computation significantly. 

Based on the algorithm above, we can derive theoretical 
properties below: 

Theorem 4.1. Algorithm [7^ is guaranteed to terminate 
in finte steps. 

Proof. The Autuuy is upper-bounded by the maximum 
accuracy of top-k recommendation among all the possible 
rankings. Each cycle (lines 4 - 15) in the algorithm will 
increase utility by at least 1/Zc if not zero. Therefore, the 
algorithm must terminate in finite iterations. □ 

The number of cycles tends to be very small in reality. In 
most cases, we just need to cycle through all items a couple 
of times. Nevertheless, we show that based on the algorithm, 
we are able to filter out certain items so that each iteration 
scans smaller number of items, which is stated below: 

Theorem 4.2. If one item i satisfies pui — 0 for all u , 
then the item i ean be removed from eonsideration of reeom- 
mendation. 


in Algorithm 1 15| In the context of eCommerce, those prod¬ 
ucts which does not appear in recent transactions can be re¬ 
moved from recommendation directly and thus reduce com¬ 
putational cost substantially. 

The proposed algorithm is sequential in nature, and thus 
difficult to parallel. The time complexity for the proposed 
algorithm is at least 0{mn log n) where m is the size of can¬ 
didate item set C, and n is the number of users. Even though 
with few cycles, the time and space complexity will be scary 
when both numbers of users and items are huge. Next we 
discuss a couple of heuristics we might consider in practice. 
All these methods have been adopted in our experiments 
and have been shown to save tremendous time and space. 

4.3 Scaling up for Practical Implementation 

Eirst of all, it is not necessary to store the prediction scores 
for all items. Considering recommendation for even just 
lOOK users with lOOK items, which is medium size in the 
era of big data. Assuming each score takes 4 bytes as a float 
number, it requires lOOiL x lOOiL x 4 = 40G memory space. 
Therefore, we suggest exploiting a sparse representation of 
prediction scores by keeping only select number of top rec¬ 
ommendations while defaulting others to zero. The number 
of recommendations to keep is typically a multiplier of k. Eor 
instance, when we need to optimize for performance@top-10, 
we can keep the top-50 recommendations from a model. Eor 
the remaining items, their predict scores are set to 0. This 
is valid because we often observe a fast decay of the scores 
no matter what the base recommendation model is. 

Eor another, we can shrink the candidate item set for bias 
tuning to reduce time complexity. Two types of items should 
be considered with higher priority for bias tuning. One set 
are those items which are recommended frequently based on 
raw prediction scores, which tend to be past popular items. 
The other set are those appearing frequently in recent trans¬ 
actions, which are the recently popular ones. The former is 
likely to lead to a negative bias, while the latter a positive 
bias. Eor other items outside the candidate set, we set their 
bias value to zero without changing their prediction scores. 

Moreover, we notice that the number of items with up¬ 
dated bias scores is dramatically decreased with iterations 
of cycling through items. The algorithm reaches a cooridate- 
wise local optimal and stops after few cycles. Yet, the scan¬ 
ning of all items are expensive, and thus early stopping cri¬ 
teria can be used. Eor example, we may set a percentage 
threshold such that if fewer than the percentage of items 
need to update bias, then we terminate the cycling. Or we 
can simply set the upper-bound of cycles. In our experi¬ 
ments, we just set it to 2 to reduce computational time, yet 
found no performance loss. 


P roo f. Let’s revisit the utility change table as shown in 
Eq. (10). Note that when is 0, swapping in a different 
item in the top-k recommendation will always increase or 
keep current utility. Because item i satisfies pui = 0 for 
all u, it follows that setting a bias of negative infinity for 
item i would lead to no utility loss overall (if not positive 
utility change), no matter whether item i is in the top-Zc 
recommendation. It is essentially equivalent to removing 
item i from consideration of recommendation. □ 


The theorem above suggests that we may remove items 
without positive relevance from the input score predictions 


4.4 Extensions to optimize MAP or NDCG 

We have described how we optimize biases with respect 
to ACC^k. Now we extend the algorithm to optimize 
MAP@k or NDCG@k. Different from accuracy, for which 
the position of one item in the top-Zc recommendation does 
not matter, MAP and NDCG is rarzZc-related metric. The 
position of one item in the top-Zc recommendation plays an 
import role. A relevant item ranked as top-1 will results in 
a different score than that when the item is ranked as the 
top Zc-th. In order to compute the potential utility change 
with different bias scores, we have to consider all the possible 
positions rather than just the top Zc-th item. Nevertheless, 









the basic idea remains the same. We compute out the score 
difference and corresponding utility change with respect to 
each position in top-k. 

Take average precision in Eq. 0 as an example. Given a 
user, assume the average precision of iop-k recommendation 
without item i is APq, serving as our reference point. We 
can gradually increase the score of item i to be ranked at 
position k, k — 1, •••, 1, and compute out the corresponding 
performance. For each position, its utility change should be 
computed as follows: 


Position 

Performance 

utility change 

k 

APk 

6k = APk — APo 

k — 1 

APk-i 

Sk-i = APk-i — APk 

1 

APi 

= APi - AP2 


Suppose item i is currently at position p +1, and we increase 
its bias to position p. Note that only items at position p and 
p + 1 are swapped, while the other items remain unchanged. 
Therefore, we only need to examine the utility change be¬ 
cause of the swap of the two items. If item i and the item at 
position p have the same relevance, i.e. pui — yiF\ the AP 
would not change, and hence leading to no utility change. If 
Pui = 1 and p^^ — 0, then it leads to a utility increase when 
we push item i into position p; If pui — 0 and = 1, then a 
reduction of utility. Let k = min{/c, ^relevant items for u}, 
the utility change can be derived below: 




Vui ' 






p + 1 




(.+&!••>) (1-,-i,) 


/k (13) 


with 


1 


= 0 if p 1 > k. 


p + 1 

In the equation, the first term in Eq. 0 is the utility 
change that item i moves from position p + 1 to position p, 
and the second term in Eq. (12) is the utility change when 


the original item at position p downgrades to position p+ 1. 
There is one special case when p + 1 > /c, i.e., item i has 
not entered into iop-k yet at the beginning. In that case, we 
replace l/(p+ 1) by 0. Similarly for NDCG, it follows that 


Ap) _ 


{ym - 


1 


1 


with 


log(l + (p+ 1)) 


log(l + p) log(l + ip+ 1)) 
= 0 if p 1 > k. 


iZuk (14) 


Note that when and are the same, both utility 
changes following Eqs (13) and ([T4| would be zero. Hence, 


rather than checking every possible position in top-Zc, we 
just need to check those positions with different relevance 
to item i. The algorithm to find optimal bias for one single 
item is summarized in Algorithm It is almost the same as 
Algorithmic except line 3-5. Line 3-4 are supposed to check 
each potential position in the top-k, and line 5 is to update 
the utility change based on corresponding performance met¬ 
ric. Apparently, such a change leads to an increase of time 


Algorithm 4: Find optimal bias wrt. MAP/NDGG 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 
17 


Input: item L scores fi^ , and relevance y*z, ; 
rank-related performance metric 
Output: optimal bias bi and utility change A^; 
initialize candidate set S = (p, cuv-util = 0; 
for each u do 

for each position p do 

if Pui / Pu^ then 

compute utility chang e w rt. performance metric 
following Eq. |l3| ) or 0; 

compute score difference Sui = f^^ — fui'^ 
append {suiAui) to S; 
if Sui < 0 then 
I cur^util = curjatil + 5ui ; 

end 

end 

end 

end 

push (—m/,0) to S; 

find bi and maxjatil via subroutine in Algorithm 
update Aj = maxjatil — curjatil] 

return bi, Aj 


complexity, reaching 0{kn\ogkn) to find optimal bias for 
single item. This is still fine if k is reasonably small, which 
is mostly true in practice. Plugging Algorithm into Algo¬ 
rithmic we obtain the bias values for all items with respect 
to select evaluation criterion. 


5. EXPERIMENT SETUP 

In this section, we mainly describe the basic setup for our 
experiments, including preparation of benchmark data sets, 
base recommendation model and other methods considering 
temporal dynamics for comparison. 


5.1 Benchmark Data Sets 

We collect customer transactions of XYZ and construct 
benchmark data sets via a split based on date. User activ¬ 
ities before the date are used for training, and the transac¬ 
tions in the subsequent week are used to evaluate recommen¬ 
dation performance. Gorresponding performance measures 
include AGG@k, MAP@k, and NDGG@k as described in 
Section [2.2[ In our experiments, k is set to 10. For easy in¬ 
terpretation, we report all numbers in terms of lift (relative 
improvement) with respect to one baseline: 

iift=( -i).ioo%. 

\perfhaseline J 

We prepare two benchmark data sets: one is during reg¬ 
ular season (Nov. 1, 2013); and the other is during holiday 
season (Dec. 11, 2013). User shopping behavior in holiday 
season tends to be quite different from regular season, both 
in terms of quantity and trending products. We aim to verify 
the efficacy of our proposed method under both settings. 


5.2 Base Recommendation Model 

One commonly used approach for recommendation in eGom- 
merce is to model user actions as a Markov chain |^ 
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is tantamount to computing item similarity |^ , but keeping 
the metric directional. That is, one’s current action depends 
only on his most recent action. The transition probability 



























from one action to another can be estimated below: 


P(buy z I bought j) 


# users who bought j then i 
# users who bought j 


(15) 


We also take into consideration of those highly associated 
purchase patterns by computing co-purchase probability of 
multiple items beyond 2. For instance, the transition of two 
actions leading to one purchase can be computed as: 


P(buy i|bought ji,^ 2 ) 


# users who bought ji, j 2 then i 
# users who bought ji, j 2 


(16) 

However, considering those higher-order purchase patterns 
leading to explosion of state spaces. Therefore, we consider 
only the state spaces containing up to 2 actions. As for 
prediction, we pick the products with highest probability 
given user’s most recent few transactions. 

This Markov model is exploited because it has been val¬ 
idated to work quite well in eCommerce. Later in experi¬ 
ments, we shall show that our proposed bias learning can be 
applied to other base recommendation models as well. 


5.3 Methods Considering Temporal Change 

For our proposed bias learning method (denoted as 
we use most recent 3-day transactions to fine tune the bias 
of items. All the biases, unless specified, are learned via op¬ 
timizing ACC@k. As for comparison, we also include one 
baseline method without considering temporal change. 


Miong' This method utilizes as long history as possible for 
training. As already shown in Figurethe longer time 
window we use, the better the recommendation model 
performs. In our experiments, we use up to 15 months 
of user activity history for training. 


Besides the baseline, there are several other approaches to 
take into account temporal dynamics. 

Mtruncate- Accordiug to Theorem |4.2[ if one item does not 
appear in recent transactions, we can remove it from 
recommendation. This method trains the recommen¬ 
dation model using 15-month data, but for prediction 
it concentrates only on those items that are recently 
attracting user attentions. It truncates the item set 
for recommendation but does not tune biases of items. 


Mdistrdiff- This method directly computes a bias term for 
each item, rather than optimizing biases with respect 
to certain metric. In particular, we compute two dis¬ 
tributions of items, one from the 15-month transac- 
tions(denoted as and the other from recent 

transactions (denoted as So the item-specific 

bias is b = ^(short) _ ^(long) ^ Biascs are added to 
the normalized probabilistic output of recommenda¬ 
tion engine to promote recent trending products while 
suppressing past popular ones. 


Mdecay- Another option is to train a model that already 
incorporates temporal dynamics, by assigning lower 
weight to those remote events. An exponential decay 
function is used: w = exp where At is the 

time gap of the purchase in Eqs ( |15| ) and (16) to cur¬ 
rent date of recommendation, and is a decay factor. 
13 is set to 60 in our experiments. 


Table 2: Lift (%) at Regular-Season Data 


Method 

ACC@10 

MAP@10 

NDCG@10 

Miong 

0 

0 

0 

Mjjias 

1.228 

0.842 

0.972 

Mtruncate 

0.460 

0.142 

0.239 

Mdistrdif f 

0.350 

0.384 

0.389 

M^oighted 

-10.703 

-11.083 

-10.495 


Table 3: Lift (%) at Holiday-Season Data 


Method 

ACC@10 

MAP@10 

NDCG@10 

Miong 

0 

0 

0 

Mijias 

5.857 

5.391 

5.482 

Mtruncate 

1.058 

0.806 

0.860 

Mdistrdif f 

0.921 

0.593 

0.705 

M^oighted 

-4.737 

-4.302 

-4.174 


Note that there has been some work to consider tempo¬ 
ral dynamics for recommendation. For example, Koren et 
al. proposed to have a time-dependent bias in matrix 

factorization for movie/music ratings. The model minizes 
the root mean squared error and adopts stochastic gradi¬ 
ent descent to find biases and latent factors. However, in 
our application, the responses are binary (either purchase 
or not purchase), and only positive responses are collected. 
A trivial solution would set bias to 1 for all, which is mean¬ 
ingless. We may randomly sample negative entries for our 
one-class collaborative filtering problem, but that would 
essentially connect bias to the sampling rate, which is not 
acceptable either. 

6. EXPERIMENTS 

In this section, we conduct a series of experiments over 
the constructed benchmark datasets to study the perfor¬ 
mance, sensitivity to base recommendation models and per¬ 
formance metrics. At the end, we report results by applying 
our method to online A/B test. 

6.1 Performance Comparison 

The performance of various methods are shown in Tablesj^ 
and|^ For easy comparison, we deem Miong as the baseline 
and show the lift of other methods. The numbers in bold face 
indicates the one with best performance. For both data sets 
at regular season and holiday season, our proposed method 
Mhias is the winner. The lift is more observable at holiday 
season because of the strong shopping pattern change thanks 
to Black Friday, Cyber Monday and other promotion cam¬ 
paigns. The numbers might look small, but keep in mind 
it took nearly 3 years and thousands of teams worldwide 
to improve 10% over a trivial baseline in Netflix prize com- 
petitiorj^ It is noticed that Mtruncate and Mdistrdiff both 
yield some improvement, suggesting that it is always help¬ 
ful to incorporate recent trend. However, neither of them is 
comparable to learning a bias for each item as we proposed. 
As shown in both tables, adding a weighted decay based on 
recency for training does not help. In short, learning biases 
to capture the temporal dynamics can model individual in¬ 
terests and preferences more accurately, and thus improve 
performance of the base recommendation model. This is es¬ 
pecially helpful when the temporal fluctuation is huge, as 
shown during the holiday season. 

^http://en.Wikipedia.org/wiki/Netflix_prize 
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Figure 3: Lift at Regular Season Figure 4: Lift 


Table 4: Coverage of Top Popular Items 


top popular items 

without bias 

with bias 

10 

3 

7 

20 

8 

15 

50 

24 

28 

100 

48 

57 

200 

97 

II6 

500 

260 

300 

1000 

531 

576 


At macro level, we observe that bias learning tends to 
shift the item distribution in our top-k recommendation to¬ 
wards the ground truth. We sort all items based on their 
frequency in reverse order in test data and predictions re¬ 
spectively, and then check the overlap of top popular items 
between the two. The result over the holiday season data is 
shown in Table A similar trend is observed during regu¬ 
lar season. Apparently, adding the bias helps the prediction 
to capture those recently trending products. For example, 
only 3 out of the top-10 popular products in testing period 
is covered by the raw recommendation. However, adding a 
bias immediately increases the coverage to 7. This pattern is 
consistent for a wide range of values as shown in the Table. 

We also compute the KL-divergence between transactions 
in testing period and predictions. As some items appear 
only in transactions or predictions, we smooth the distribu¬ 
tions by considering all zero frequency as 0.01 to avoid an 
unbounded divergence value. For the data in regular sea¬ 
son, the KL-divergence between predictions and true trans¬ 
actions is 0.9413, and decreases to 0.9406 once biases are 
added. Similarly for data in holiday season, the divergences 
reduces from 0.7263 to 0.6757 owning to learned biases. 

In summary, our proposed method is able to learn bi¬ 
ases that incorporate temporal dynamics of items, both at 
individual-level and macro-level, and hence improving rec¬ 
ommendation performance. This improvement is more ob¬ 
servable when there is a big difference between data distri¬ 
bution for training and recommendation time. 

6.2 Optimizing Different Performance Metrics 

Here we apply the proposed bias learning algorithm with 
respect to rank-related metrics like MAP or NDCG. Fig¬ 
ures and plot the lift of M^ias over baseline Miong at 
both data sets. First of all, all methods lead to a positive 
lift, implying the effectiveness of learning biases. Neverthe¬ 
less, the optimization criteria results in final performance 
difference. Initially, we conjecture that optimizing one met¬ 
ric would lead to higher numbers in terms of that particular 
metric, e.g., optimizing MAP should result in higher MAP 
in test performance. In reality, this is not the case. As seen 


at Holiday Season Figure 5: Bias Values 

in Figure^ optimizing MAP actually results in lower perfor¬ 
mance in terms of all three metrics, and optimizing NDCG, 
on the contrary, yields comparable performance to ACC. 

We also compare the difference of learned biases of opti¬ 
mizing ACC or NDCG. It turns out we learned 386 non-zero 
biases for ACC, 402 non-zero biases for NDCG and 384 are 
shared between both. The corresponding bias values are 
shown in Figure with items sorted based on bias values 
of optimizing ACC. It is noticed that majority of them have 
close-to-zero values. Though some values are positive, more 
bias values are towards negative, suggesting many item bi¬ 
ases tries to discount the popularity in the training data. 
The scale of negative values tend to be much larger. More¬ 
over, optimizing NDCG gives less extreme values, because 
it considers all the possible ranking positions of iop-k. 

Recall that the time complexity of each iteration in op¬ 
timizing NDCG is k times larger than that of optimizing 
ACC. Therefore, we have to strike a balance between the 
performance and computational time. In practice, we can 
have one trained recommendation model, and just need to 
update the biases daily. Faster convergence can be accom¬ 
plished via warm start, i.e., adopting the learned biases yes¬ 
terday for initialization. For our experiment purpose, we 
just report findings of optimizing ACC thereafter. 

6.3 Bias Learning for Different Base Models 

In previous subsections, we have mainly studied proper¬ 
ties of bias learning with the Markov model. Here we ex¬ 
plore other base recommendation models. Our proposed 
bias learning is kind of orthogonal to the base model be¬ 
ing used and it is applicable to a wide array of models. Two 
models are considered here: matrix factorization (MF) and 
category-based recommendation (CBR). 

Matrix factorization (MF) gained momentum thanks to 
the Netflix prize competition [1 1 1 |2I] . It is shown to be 
one of the start-of-the-art methods for collaborative filter¬ 
ing. Standard matrix factorization aims to approximate a 
user-item matrix as the product of two low-rank matrices: 
^nxm ~ PnxiQexm- where P and Q are the latent factors of 
users and items, respectively. Typical matrix factorization 
(either through alternating least squares or stochastic gradi¬ 
ent descent) requires an iterative process, which involves too 
much overhead when implemented in MapReduce in order 
to deal with large-scale data sets. Alternatively, we imple¬ 
mented a randomized version of matrix factorization as de¬ 
scribed in . It utilizes a randomized SVD to compute 
approximate Q and then determines P given Q. 

Category-based recommendation (CBR) assumes a user is 
more likely to purchase a product within the same category 



























Table 5: Lift(%) of Bias Learning over Base Models 


base model 

MF 

GBR 

Markov 

regular-season 

39.100 

0.425 

1.228 

holiday-season 

27.938 

5.477 

5.857 


if he has already indicated an interest in a category. 

P(buy ijuser u) — P(buy i|buy in c) •P(buy in cjuser u). 

c 

The interest categories of one user P(buy in cjuser u) is 
computed through his past actions. Each user is repre¬ 
sented as a multinomial distribution of interest categories, 
by mapping each of his actions to its corresponding cate¬ 
gory in a carefully curated product taxonomy. To estimate 
P(buy ijbuy in c), we examine the popularity of each prod¬ 
uct among existing transactions. In order to capture the re¬ 
cent trend, we apply Mtruncate, that is, we restrict ourselves 
to look at transactions only within the past few days/weeks 
at recommendation time. Therefore, this category-based 
recommendation tends to pick those recent best-selling prod¬ 
ucts given one’s personal interest categories. 

We apply bias learning to both models. For brevity, we 
just report the lift in terms of ACC@10 over different base 
models in TableFor all base models, we observe a positive 
lift, suggesting that our bias learning is able to capture the 
temporal dynamics no matter which model is being used. 
The lift over MF is substantial, partly because of MF’s poor 
performance itself. To our surprise, among all three meth¬ 
ods, matrix factorization performs the worst. Such a poor 
performance of matrix factorization is also observed in other 
domains with binary responses [^. One factor is that ma¬ 
trix construction based on transactions is critical yet not 
well dehned. It is difficult to incorporate both frequency 
and recency information simultaneously into one matrix. 

6.4 Online A/B Tests 

Here we run online A/B test through email campaigns 
to further examine the impact of added bias for recommen¬ 
dation. Each email contains 8 item recommendations. As 
mentioned earlier, the Markov model works quite well in 
our domain and is adopted for base recommendation. We 
sample a small percentage of XYZ customers and randomly 
split them into two buckets, one with bias learning and 
the other without. We run two tests, on 2013/11/26 and 
2013/12/07, respectively. Both tests sent around 800K mar¬ 
keting emails. Three widely-used metrics are recorded: the 
click-through rate (CTR), average number of orders and rev¬ 
enue per email-open. We attribute one order/revenue to be 
from marketing emails only if customers receiving emails 
click on one link in the email and place order(s) within the 
same session. The lifts after bias learning are shown in Ta¬ 
ble |6l For all metrics, we see a positive lift, though only the 
CTR is shown to be significant. Because the number of or¬ 
ders was extremely small, it is difficult to reject the null hy¬ 
pothesis given limited impressions. These online tests con¬ 
firm our hypothesis that adding a bias to capture temporal 
dynamics intrigue more customers to click and place orders 
subsequently, suggesting more effective recommendation. 

Table 6: Lift(%) of online recommendation with bias 


Date 

GTR 

avg #order 

avg revenue 

2013/11/26 

3.35 

6.10 

5.64 

2013/12/07 

9.30 

108.05 

102.76 


7. RELATED WORK 

Mining concept-drifting data streams for classifica¬ 
tion and pattern mining has been studied extensively. Yet 
considering temporal changes for recommendation is gain¬ 
ing some attention recently. Koren et al. proposed 

to have a time-dependent bias in matrix factorization for 
movie/music ratings. But the proposed method is not ap¬ 
plicable for one-class collaborative hltering problem. 
Moreover, it aims to minimize the root mean squared er¬ 
ror (which is differentiable) rather than ranking metrics for 
top-k recommendation. On the other hand, Xiong et al. 
formulate temporal collaborative filtering as a tensor factor¬ 
ization by treating time as one additional dimension. Wang 
et al. |24[ |25] consider the time gap between purchases and 
propose an opportunity model to identify not only the items 
to recommend, but also the best timing to recommend a par¬ 
ticular product. Meanwhile, improving temporal diversity of 
recommendation across time is also considered. 

Another related domain is learning to rank [^, which is 
initially motivated for the problem of information retrieval 
given queries. Making recommendations by learning to rank 
has attracted lots of attentions recently . Eigen- 

Rank extends memory-based (or similarity-based) meth¬ 
ods by considering the ranking (rather than rating) of items 
in computing user similarities. Matrix factorization has been 
extended to optimize for ranking-oriented loss as well. But 
most ranking-related metrics are non-smooth or non-convex. 
Hence, majority of the methods either approximate the loss 
via a smooth function or hnd a smooth lower/upper bound 
for the loss function. For instance, CofiRank extends 
matrix factorization to optimize ranking measures like NDCG 
instead of rating measures. Because NDCG is non-convex, 
the authors propose a couple of steps to find a a convex 
upper-bound for the non-convex problem and adopt bun¬ 
dle method for optimization. CLiMF instead optimizes 
a lower bound of smooth reciprocal rank. Our proposed 
method differs because we explicitly optimizes for the exact 
ranking measure. This is viable because we are learning only 
biases, rather than latent factors, with the ranking loss. 

Our proposed bias learning method in collaborative fil¬ 
tering is partly inspired from the thresholding problem in 
multi-class/label classffication [^[^. For large-scale multi¬ 
class/label classification problem, one-vs-rest is still widely 
used. That is, for each class we construct a binary classifier 
by treating the class as positive, and the remaining classes 
as negative. Since each binary classiher is constructed inde¬ 
pendently, researchers propose to learns a threshold (bias) 
for each class mainly to optimize classification accuracy, pre¬ 
cision/recall or F-measure. However, in top-k recommenda¬ 
tion, the score difference and ranking of items matter, mak¬ 
ing all the items dependent on each other. Also, the moti¬ 
vation of this work is mainly to capture temporal dynamics 
rather than calibrating the classifier prediction scores. 

8. CONCLUSIONS AND FUTURE WORK 

This work attempts to take into account temporal dynam¬ 
ics for top-k recommendation. It is motivated from the ob¬ 
servation that certain domains, e.g. eGommerce, are highly 
dynamic. Since user feedbacks are likely to be rare in most 
recommender systems, we suggest keeping as much data as 
possible for training recommendation model to avoid spar¬ 
sity problem. On the other hand, we propose to learn a 










time-dependent bias for each item based on recent user feed¬ 
back only to capture the temporal trend change. We define 
the bias learning problem and present a coordinate-descent 
like algorithm to optimize ranking-based measures like ACC, 
MAP or NDCG. We prove that the algorithm is guaranteed 
to terminate in finite steps with reasonable time complex¬ 
ity. Empirical results via both offline and online experiments 
demonstrate that the proposed bias learning method is able 
to boost the performance of base recommendation models, 
and capture the temporal shift in user feedback. As the 
bias learning works independently of base recommendation 
model being used, we encourage other practitioners to add it 
as a standard module in recommender systems where tem¬ 
poral dynamics are a norm. 

A couple of problems remain open. Even though we have 
provided some guidelines to reduce computational cost for 
bias learning, the proposed algorithm is sequential in nature 
and thus difficult to harness the power of parallel/distributed 
computing. Its scalability needs to be improved. Eor an¬ 
other, it has been shown that one item can be removed from 
recommendation if it does not appear in the recent trans¬ 
actions. So bias learning would not pick new items. This 
seems to be against the initial purpose of recommendation, 
to encourage users to discover more items in the long tail. It 
is pressing to understand more about the balance between 
relevance, popularity and serendipity in recommendation. 
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